Our study has provided some insights into the current state of verb
subcategorization frame acquisition for biomedicine.  First, using a
system unadapted for biomedicine but applied to a large biomedical
corpus, we achieved acceptable results using a simple relative
frequency threshold similar to previously reported optimum
thresholds. A new method of SCF-specific filtering was found to offer
improved accuracy even though it depended on SCF frequency information
from general language. Still, performance drops off considerably
compared to general language, losing more than 10 points on F-score,
indicating that there is room for adaptation of SCF systems to
biomedicine.

Second, we compared two biomedical SCF lexicons, each representing a
different aspect of the state of the art in SCF acquisition. We found
that the BioLexicon, built with an SCF acquisition system in which
each component has been adapted to biomedical text using manually
annotated data in the molecular biology subdomain, favored precision
over recall when evaluated against our SCF gold standard drawn from
across PMC OA. On the other hand, BioVALEX, built using a state of the
art system for general language SCF acquisition and unadapted to
biomedical text save for the input corpus, favored recall over
precision. Neither type of system is ideal, and the contrast between
the two highlights the need for domain adaptation techniques that can
cover a broader range of subdomains.

Next, we observed that using two different definitions of
subcategorization -- the ``semantic'' definition, which collapses the
argument-adjunct distinction, and the ``syntactic'' definition, which
retains it -- result in very different styles of annotation, and
therefore different evaluation results for an SCF system depending on
the definition used in the gold standard.  Interestingly, because the
Cambridge system readily hypothesizes many phrase types co-occurring
with verbs as part of the SCF, it is more consistent with the semantic
definition of subcategorization and achieved higher accuracy on the
semantic gold standard than the syntactic one. This behavior may or
may not be desirable depending on the application, but needs to be
taken into consideration.

Finally, we found significant variation in SCF behavior between
biomedical subdomains, with different properties than in previously
studied lexical variation.  Most notably, subdomain clusters produced
from the subcategorization behavior of individual verbs did not align
well with clusters based on simple lemma
frequencies \cite{Lippincott:2011}, and often were not readily
interpretable in terms of major subdomain-spanning topics. Some verb
behavior occurred in discrete pockets, just one or two subdomains,
rather than in one of the major clusters identified in lexical
studies.  One factor in this result is that 
% This is because 
we considered individual verbs, whereas lexical studies average 
% the
variation across all lexical items of a given class.  
% Since the SCF distributions are more
% complex than the simple presence/absence of lexical items, we could
% see 
Another potential factor is that distinct senses of a verb, e.g.~general and specialized, may create confounding effects when the SCF behavior of the two senses is overlaid in a subdomain.
% SCF behaviors of a verb
% could be overlayed in a subdomain to produce a new distribution.  This
% phenomenon is likely related to the use of different verb senses,
% e.g.~general and specialized.  
Future work could involve broadening
the set of verbs considered and averaging the divergence in their SCF
distributions to determine whether there is a correlation with the
lexical results.  This would require a principled way of combining the
distributions, beyond simple equal weighting, because the proportion
of verbs that change SCF behavior is small and would be overwhelmed by
noise.


Our results suggest several drawbacks to current methods of SCF
acquisition for biomedicine.  First, differences in the definition of
subcategorization, and consequently different SCF inventories and gold
standards, make performance comparison difficult.  Second, the large
performance drop when applying an unadapted system to biomedical text
demonstrates the need for domain-specific adaptation.  Third, the
significant subdomain variation suggests that adaptation based on a
small subset of biomedical text does not address biomedical text in
general.  Current biomedical lexicons rely either directly on manual
annotation, or indirectly on other resources that do.  Because of the
difficulty in producing such resources, it is not feasible to rebuild
them for new subdomains.

Task-based evaluation uses the output of a system to augment performance on a downstream task that is easier to assess \citep{vlachos:2011}.
For example, an unlexicalized parser or relationship extractor could be augmented with SCF
probabilities, and then re-evaluated to determine improvement.  In this setup, the definition of subcategorization and the SCF inventories used by each system would not need to be reconciled: the candidate parses would simply be reranked based on the new probabilities from the lexicon. Some promising results in this direction have already been obtained for general language \cite{Carroll:98}.

By decoupling evaluation from a particular definition and inventory, unsupervised methods, such as 
clustering and graphical models, could be evaluated alongside supervised and rule-based methods.
Unsupervised approaches have a particular advantage in domain adaptation, since they do not
rely on domain-specific resources, and because their definitions and inventories emerge from
their domain-specific training data.  Ideally, this would also involve moving away from features 
that are domain-sensitive, such as parser output, to shallower and more robust features like 
parts-of-speech or phrase chunking.  There are a range of semi-supervised methods between these
extremes, such as self-training and hybrid graphical modeling \citep{zhu:2006}.  A potential area for future work is determining an optimal middle ground.

Finally, some of the best results on SCF acquisition for general
language have used information about verb semantic classes to smooth
conditional SCF distributions \citep{korhonen:02}, based on the
linguistic fact that semantically similar verbs tend to have
syntactically similar behavior. This avenue needs further exploration
in biomedicine. Incorporating word-sense disambiguation may also
improve accuracy and understanding of subcategorization in
biomedicine, especially since we see that verb behavior in different
subdomains may involve overlays of general and 
specialized
% biomedical 
senses.

%  Furthermore, questions of
%gold standard annotation would be less crucial.

%Task-based evaluation is naturally related to more unsupervised methods, such as clustering and graphical models, where latent variables grow to fit the data.  

%This would also allow unsupervised methods, without a
%predetermined SCF inventory, to compete 

%  Furthermore,
%it would eliminate the problems inherent in manually annotating a
%gold standard for such a difficult task.

%A feature common to Biolexicon and the Cambridge system is reliance on parser
%output, namely grammatical relations.  Because of this, decisions
%about argument structure are made prior to SCF extraction (and in
%BioLexicon's case, prior to building an SCF inventory).  A method that uses shallower features, such as parts-of-speech or shallow chunking, would be easier to adapt to new domains.



